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Abstract A large and diverse set of measurements are regularly collected during a patient’s 
hospital stay to monitor their health status. Tools for integrating these measurements into 
severity scores, that accurately track changes in illness severity, can improve clinicians abil¬ 
ity to provide timely interventions. Existing approaches for creating such scores either 1) 
rely on experts to fully specify the severity score, 2) infer a score using detailed models of 
disease progression, or 3) train a predictive score, using supervised learning, by regressing 
against a surrogate marker of severity such as the presence of downstream adverse events. 
The first approach does not extend to diseases where an accurate score cannot be elicited 
from experts. The second assumes that the progression of disease can be accurately mod¬ 
eled, limiting its application to populations with simple, well-understood disease dynamics. 
The third approach, also most commonly used, often produces scores that suffer from bias 
due to treatment-related censoring (Paxton et al, 2013). Specifically, since the downstream 
outcomes used for their training are observed only noisily and are influenced by treatment 
administration patterns, these scores do not generalize well when treatment administration 
patterns change. We propose a novel ranking based framework for disease severity score 
learning (DSSL). DSSL exploits the following key observation: while it is challenging for 
experts to quantify the disease severity at any given time, it is often easy to compare the 
disease severity at two different times. Extending existing ranking algorithms, DSSL learns 
a function that maps a vector of patient’s measurements to a scalar severity score subject to 
two constraints. First, the resulting score should be consistent with the expert’s ranking of 
the disease severity state. Second, changes in score between consecutive periods should be 
smooth. We apply DSSL to the problem of learning a sepsis severity score using a large, real- 
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world electronic health record dataset. The learned scores significantly outperform state-of- 
the-art clinical scores in ranking patient states by severity and in early detection of down¬ 
stream adverse events. We also show that the learned disease severity trajectories are consis¬ 
tent with clinical expectations of disease evolution. Further, we simulate datasets containing 
different treatment administration patterns and show that DSSL shows better generalization 
performance to changes in treatment patterns compared to the above approaches. 


1 Introduction 

Consider the task of monitoring patients admitted to the Intensive Care Unit (ICU). Clini¬ 
cians must regularly assess for changes in disease severity to plan timely interventions. Since 
direct observation of a patient’s disease state is rarely possible, assessing severity requires 
the caregiver to interpret a diverse array of markers (e.g., heart rate, respiratory rate, blood 
counts, and serum measurements) that measure the underlying physiologic and metabolic 
state. In Figure I, we show a subset of such data collected on a single patient in the intensive 
care unit over the 48-hour period preceding when they experienced septic shock. Continuous 
assessments of whether an individual is at-risk based on this data is both time-consuming 
and challenging. 


Derived 
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Raw 

Feature 



Fig. 1 Measurements over time for an example patient in an intensive care unit (ICU). In blue, we identify 
the feature vector xP at time t = 33 hours, i.e., all available measurements for a patient p at time t. 

In this paper, we address the problem of quantifying (scoring) the latent severity of an in¬ 
dividual’s disease at a given time. That is, we derive a mapping from the high-dimensional 
observed marker data to a numeric score that tracks changes in severity of the underlying 
disease state over time — as health worsens, the score increases, and as the individual’s 
health improves, the score declines. Accurate estimation and tracking of the underlying dis¬ 
ease severity can enable clinicians to detect critical decline such as decompensations, and 
acute adverse events in a timely manner. Additional benefits of accurate disease severity es¬ 
timation include a means for measuring an individual’s response to therapy and stratification 
of patients for resource management and clinical research (Keegan et al, 2011). 
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Defining a Disease Severity Seore: Qualitatively, the concept of a disease severity score 
has been described as the total effect of disease on the body; the irreversible effect is re¬ 
ferred to as damage, while the reversible component is referred to as activity (Medsger et al, 
2003). The precise interpretation of concepts of damage and activity are typically based on 
the application at hand. Desirable properties of a severity scale include: 1) face and content 
validity i.e., the variables included are important and clinically credible, and 2) construct va¬ 
lidity i.e., the scoring system parallels an independently ascertained severity measurement 
(Medsger et al, 2003). 

Prior Art: Historically, severity scores have been designed in a number of different ways 
(Ghanem-Zoubi et al, 2011). One approach is to have clinical experts fully specify the score. 
Namely, using existing clinical literature, a panel of experts identifies factors that are most 
indicative of severity of the target disease. These factors are weighted by their relative con¬ 
tribution to the severity and summed together to yield the total resulting score. For example, 
the Acute Physiology And Chronic Health conditions score (Knaus et al, 1985) (APACHE 
II), which assesses the overall health state in an in-patient setting, uses factors that are most 
predictive of mortality. A heart rate between 110 and 139 beats per minute adds 2 points 
to the final score while a heart rate higher than 180 beats per minute adds 4 points. Sim¬ 
ilarly, mean arterial blood pressure between 70 and 109 mm Hg adds no points while a 
value between 50 and 69 mm Hg adds 2 points. A number of additional widely used scoring 
systems have been designed in this way, including the Multiple Organ Dysfunction Score 
(Marshall et al, 1995) (MODS), the Sequential Organ Failure Assessment (Vincent et al, 
1996) (SOFA), and Medsger’s scoring system (Medsger et al, 2003). 

A second approach commonly taken is to assume that the severity can be characterized 
in terms of another surrogate measure such as the risk of an impending adverse event or 
mortality. This method relies on the intuition that high severity states are more likely to 
be associated with adverse events and higher mortality rates. The disease severity score is 
then learned by regressing a mapping between observed biomarkers and elements of clinical 
history and the risk. For instance, the pneumonia severity index (PSI) combines 19 factors 
including age, vitals and laboratory test results, to calculate the probability of morbidity 
and mortality among patients with community acquired pneumonia (Fine et al, 1997). The 
relative weight of each factor in the resulting score was derived by training a logistic re¬ 
gression predictor of patient’s death in the following 30-hour window. For simplicity of use, 
the relative weights were normalized so that the weight of the age would be equal to one 
and rounded up to the closest multiple of 10 (of 15 for temperature). Others have similarly 
used downstream adverse events such as the development of Clostridium difficile infection 
(Wiens et al, 2012), septic shock (Ho et al, 2012), morbidity (Saria et al, 2010b), and mor¬ 
tality (Pirracchio et al, 2015) as surrogate sources of supervision for training severity scores. 

A third approach uses probabilistic state estimation techniques to track disease sever¬ 
ity and progression (e.g.. Mould, 2012; Jackson et al, 2003; Saria et al, 2010a; Wang et al, 
2014). These model disease progression as a function of the observed measurements. For 
example, Jackson et al (2003) study abdominal aortic aneurysms in elderly men. They divide 
the progression of this disease into discrete stages of increasing severity according to suc¬ 
cessive ranges of aortic diameter. The disease dynamics is modeled using a hidden Markov 
model (HMM), which allows to capture both the transition between the stages and the stage 
misclassification probability. The parameters of this model are estimated using maximum 
likelihood. Once model parameters are known, the disease severity of a patient (the unob¬ 
served state of the HMM) at a given time can be obtained by inference on the learned model. 
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However, all of the above-mentioned approaches for derivation of disease severity scores 
have their limitations. The expert-based approach captures known clinical expertise well, 
but does not extend to populations where the current clinical knowledge is incomplete. The 
progression modeling based approaches require making assumptions about the disease dy¬ 
namics and are therefore only applicable to diseases where the dynamics are relatively well- 
understood. Finally, the third approach, also the most commonly used, often produces scores 
that suffer from bias due to treatment-related censoring (Paxton et al, 2013). To see why, 
note that for model training, supervised examples are obtained by annotating each patient’s 
record as a positive or negative training example depending on whether they experienced the 
target outcome or not (Fine et al, 1997). However, a high-risk patient, if treated in a timely 
manner, may not experience the adverse event. If there is a group of such patients, who are 
consistently treated and therefore never experience the adverse event, the learning algorithm 
will consider their symptoms preceding their treatment as low-risk states, and give it a low 
severity score. This poses a problem when this severity score is moved to a different environ¬ 
ment where treatment decisions are made based on the score alone. A caregiver may chose 
not to treat these high-risk state because of their low score, thereby worsening outcomes. 
We elaborate on this issue further with the SyntheticFlu example in Section 3.1. Accounting 
for the effects of treatments on the downstream outcome is one way to circumvent this issue 
(e.g., Henry et al, 2015), in this paper we propose an alternative framework. 

Our contribution. We propose Disease Severity Score Learning (DSSL) framework that 
exploits this key observation that, while requesting experts to quantify disease severity at 
a given time is challenging, acquiring clinical comparisons — clinical assessments that or¬ 
der the disease severity at two different times — is often easy. These clinical comparisons, 
compared to labels based on downstream adverse events, are also less sensitive to treatment 
patterns. Further, in the majority of diseases, clinical guidelines provide rules for coarse¬ 
grained assessment of stages of a disease (see examples in AHRQ, 2015). These stages can 
be used to augment expert-provided clinical comparisons with those that are automatically 
generated using these guidelines. We show how we leverage an existing guideline (Dellinger 
et al, 2013) in our example application. 

DSSL uses clinical comparisons within the same patient and across patients to train a 
temporally smooth disease severity score. From these clinical comparisons, DSSL learns a 
function that maps the patients observed feature vectors to a scalar severity score. With some 
abuse of terminology, we refer to this mapping function as the disease severity score (DSS) 
function. We present two different algorithms for learning the DSS — the first in the linear 
setting, and the second in the non-linear setting. In both cases, the parameters of the DSS 
function are found by optimizing an objective function that contains two key terms. The 
first term penalizes for pairs that are incorrectly ordered by their severity. The second term 
imposes a penalty on changes of the severity score that are driven by the temporal evolution 
of the disease. For example, in our application, sepsis evolves slowly and the learning ob¬ 
jective leverages this by penalizing scores that are not smooth over those that are. We show 
how two commonly used ranking algorithms can be extended to our problem in a relatively 
straightforward manner. For the linear DSS, we extend the soft max-margin formulation by 
Joachims (2002) to maximize separation between ordered pairs while preserving temporal 
smoothness. For the non-linear DSS, the score is represented non-parametrically using a 
weighted sum of regression trees. We use an optimization procedure similar to that of gra¬ 
dient boosted regression (Mason et al, 1999; Friedman, 2001) to obtain the DSS function 
parameters. We show numerical results on the task of training a sepsis severity score for 
patients in the ICU. 
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Below, we highlight the main strengths of the proposed DSS learning framework: 

1. Our learning algorithm provides a scalable and automatic approach to learning disease 
severity scores in new disease domains and populations. 

2. Our learning algorithm only requires a means for obtaining clinical comparisons — or¬ 
dered pairs comparing disease severity state at different times. This form of supervision 
is more natural to elicit than asking clinical experts to map the disease severity score, 
or encoding an accurate model of disease progression. Moreover, this supervision can 
often be generated automatically. Our approach allows experts to tune the quality of the 
score by increasing the granularity and amount of supervision given. 

3. We show that our algorithm learns scores that are consistent with clinical expectations. 
For example, changes in the severity score over consecutive time periods are smooth 
and the score is higher in periods adjacent to an adverse event. Additionally, the score is 
sensitive to changes in disease severity state due to therapies. 


2 The Disease Severity Seore Learning (DSSL) Framework 

In this section, we introduce our methodology for learning DSS functions. We begin by 
outlining the general framework for learning a temporally smooth disease severity score 
in Section 2.1. Section 2.2 presents the soft-margin approach for learning a linear DSS 
function. In Section 2.3, we extend our methodology to non-linear DSS functions using 
gradient boosted regression trees. 


2.1 Overview 

We consider data that are routinely collected in a hospital setting. These include covariates 
such as age, gender, and clinical history (e.g., presence or absence of a clinical condition 
such as AIDS or Diabetes) obtained at the time of admission; time-varying measurements 
such as heart rate, respiratory rate, urine volume obtained throughout the length of stay; 
and text notes summarizing the patients evolving health status. These data are processed 
and transformed into tuples < > where e TZ'^ is a d-dimensional feature vector 

associated with patient p € P at time for i e {1, ...,T^} and is the total number 
of tuples for patient p. A feature vector contains raw measurements (e.g., last measured 
heart rate or last measured white blood cell count) and features derived from one or more 
measurements (e.g., the mean and variance of the measured heart rate over the last six hours 
or the total urine output in the last six hours per kilogram of weight). In Figure 1 in Section I, 
we showed example components of a feature vector computed for a patient in the intensive 
care unit over a 48 hour period. Let D denote the set of tuples across all patients in the study. 

The problem of learning a DSS function is defined by the sets O and S of pairs of 
tuples from the set D of all tuples, and by the set G of permissible DSS functions. The 
set O contains pairs of tuples (< >, < xj,fj >) that are ordered by severity based 

on clinical assessments. We refer to each of these paired tuples as a clinical comparison 
and the set O as the set of all available clinical comparisons. For notational simplicity, we 
assume that x? corresponds to a more severe state than x^. These clinical comparisons can 
be obtained by presenting clinicians with data x^ for patient p € P at time and data 
xj for patient q € P at time tj. For each such pair of feature vectors, the clinical expert 
identifies which of these correspond to a more severe health state; the expert can choose not 
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to provide a comparison for a pair where the severity ordering is ambiguous. These pairs 
can also be generated in an automated fashion by leveraging existing clinical guidelines. In 
Section 3.3.1, we describe how we use an existing guideline in our application. 

The set S contains pairs of tuples (< >,< >) that correspond to 

feature vectors that are taken from the same patient p at consecutive time steps and 
These pairs are used to impose smoothness constrains on the learned severity scores. We thus 
refer to the pairs in S as the smoothness pairs. Finally, the set G contains a parameterized 
family of candidate DSS functions g that map feature vectors x to a scalar severity score. 

Our goal is to identify a function g € G that quantifies the severity of the disease state 
represented by a feature vector x. In particular, this function should correctly order any 
pair (x, x') of feature vectors by their severity, and the resulting score should be temporally 
smooth to mimic the natural inertia exhibited by our biological system. We use empirical 
risk minimization to identify such a function g. Namely, we construct an objective function 
C® that maps functions g € G to their empirical risk. The first of the two terms in C® is 


E 


«xr, 


i»es 


- 5 (> 


(1) 


This term penalizes DSS functions that exhibit large changes in the severity score over short 
durations, hence encouraging selection of temporally smooth DSS functions. The second 
term in C® penalizes g for pairs of tuples (< >, < >) e O for which the 

severity ordering induced by g on vectors x^ and xj is inconsistent with the ground truth 
clinical assessment, i.e., p(x?) < < 7 (xp. We discuss the full objective comprising these two 
terms in greater detail in 2 . 2 . 

In the following two sections we describe the objectives and corresponding optimization 
algorithms for learning the linear and non-linear DSS functions in a new disease domain. 


2.2 Learning a Linear DSS 

We first consider the problem of learning linear DSS functions, i.e., DSS functions of the 
form (/u)(x) = w^x. We refer to the corresponding learning procedure as L-DSS. 

We employ soft max-margin training (Joachims, 2002) where we seek to maximize the 
distance between the pairs that are at different severity levels while keeping the distance be¬ 
tween the consecutive pairs smooth. We briefly review the key concepts of soft max-margin 
ranking before we describe our extension for learning a linear DSS function given data. 

Soft Max-Margin Ranking: Consider the toy example shown in Figure 2. Let D contain 
the three feature vectors {xi,X 2 ,X 3 } where x^ G and O contain the pairs (x 2 ,xi) 
and (x 3 ,X 2 ), i.e., feature vectors X 2 and X 3 have higher disease severity than xi and X 2 
respectively. Max-margin ranking seeks to find a vector w such that the margin between 
pairs of different severity levels is maximized. In our example, we show parameter vectors 
wi, W2 and W3 for three candidate ranking functions in Figure 2. For each feature vector 
X, the assigned (severity) score for a given ranking function parameter is computed as 
the projection, pw; (x), of x on w^. The induced ranking between two vectors xi and X 2 is 
computed based on the margin which is defined as the difference in their projections. In the 
example shown, the rankings induced by both gwi and pwg correctly order all pairs in O, 
i.e.. 
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Fig. 2 We show the projections of xi, X 2 and X 3 on vectors wi and W 2 representing two candidate ranking 
functions; vectors drawn in red, green and blue identify projections of xi, X 2 and X 3 respectively. Ranking is 
induced by the differences in the projections—for example, W 2 induces the ordering g-w 2 (^i) > Pw 2 (^^ 2 ) 
because gw 2 (xi) — gw 2 (^ 2 ) > 0 . 

Swi(x3) > C/wi(x2) > gwi(xi) andgw3(x3) >gw3(x2) >ffw3(xi), 

while the rankings induced by W 2 do not. Furthermore, W 3 also induces an ordering with 
a larger margin between the pairs in O. Margin-maximization leads to an ordering that is 
more robust with respect to noise in x. 

More formally, for each pair of feature vectors (xi,Xj) £ O, we define the margin of 
their separation by the function (/w(-) as /ij'j = 5 w(xi) — g-w{'x.j). The maximum-margin 
approach suggests that we can improve generalization and robustness of the learned sepa¬ 
rator by selecting w that maximizes the number of tuples that are ordered correctly (i.e., 
fj.J'j > 0) while simultaneously maximizing the minimal normalized margin ||w||. Us¬ 
ing the standard soft max-margin framework, the SVMRank algorithm (Joachims, 2002) 
approximates the above-mentioned problem as the following convex optimization program: 


min 

w.c;-"' 



|o| 


E *■’ 

(xi,x_j)GO 


subject to the following ordering constraints : 

V(xi,Xj) £ O : - 5 w(xj) > 1 - Co^ and Cd^ > 0 


(2) 


The L-DSS Objective and Optimization Algorithm: We now describe our algorithm for 
learning linear DSS functions. We return to our original setting where we are given sets O 
and S which contain feature vectors belong to more than one patients at varying times. 
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We augment the soft-max margin objective with the additional term, shown in Eq. (1), 
that encourages temporal smoothness. We state the full L-DSS objective below. 


min ^ IIwf 


|OL, 


‘>0 




5w(xf ) -pw(xP) 




subject to the following ordering constraints: 

^(< >:< >) ^ O : 5 w(xf’) - pw(xp > 1 - and 

V(< xf.tf >, < x],t« >) G O : > 0 

Here, the coefficients Aq and As control the relative degree of emphasis on the smoothness 
versus the margin-maximization component of the objective. For a given setting of Ao, dif¬ 
ferent choices of As yield trajectories with differing levels of smoothness. An appropriate 
choice of As could be determined by the clinical user based on the rate of change in severity 
that is to be expected in that domain. For example, in sepsis, changes in severity do not 
occur within minutes while in many cardiac conditions, rapid changes in severity can occur. 
Alternately, this parameter can be set using cross-validation to optimize performance for a 
particular application of DSS. 

In Eq. (3), for every value of w, the optimal values of are given by 


Ap,i),i<},J) _ 
so ~ 


= max{0,1 - ( 5 w(xf’) - pw(xp)}. 


Substituting Eq. (4) and (;w(x) = w^x in Eq. (3), we obtain the following unconstrained 
convex optimization formulation: 


2 " " ^| 0 | 


max{0,1 - w (x[ - xp} (5) 




^(x^t-x?) 
IL, - tf 


Instead of solving the dual formulation as in Joachims (2002), following the reasons of 
efficiency and accuracy discussed by Chapelle and Keerthi (2010), we solve the primal form 
of this optimization program as follows. 

The terms of the form max{0, a}, also called the hinge loss, are not differentiable at 
a = 0. We approximate these terms with the Huber loss for 0 < /i < 1 given by 


Lh{a) 


fo 

) 4h 

[a 


, if a < —h 
, if |a| < ft 
, if a > h 
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This approximation yields the following unconstrained, convex, twice-differentiable opti¬ 
mization problem: 


L-DSS 
Objective ’ 


imn - ||w|l 


+ 


| 0 | 


E 


L,(l-wT(xf-xp) 


As 

+ — 
1^1 


E 




(xLr-xf) 


^i+i ‘'i 


( 6 ) 


We solve this optimization program using the Newton-Raphson algorithm. We show 
experiments using the L-DSS learner in Section 3. 


2.3 Learning a Non-linear DSS 

In many disease domains, assuming a linear mapping between the measurements and the 
latent disease severity may be too restrictive. For example, ranges for measurements values 
that are considered to be normal (or from a low-severity state) are often age dependent or 
clinical history dependent. Consider an individual with a pre-existing kidney condition; he or 
she is likely to have a worse baseline creatinine level (a test that measures kidney function) 
compared to an individual with fully-functioning kidneys. Thus, when measuring changes 
in severity related to the kidney, these individuals are likely to manifest a disease differently. 
See the guideline by Dellinger et al (2013) for other examples. 

To learn non-linear DSS functions, we represent p as a weighted sum of regression trees. 
Alternate choices for learning non-linear DSS functions exist including extending the soft- 
margin formulation presented for learning L-DSS via use of the “kernel-trick” (Kuo et al, 
2014). We chose to extend boosted regression trees as this is one of the most widely used 
algorithms for ranking (e.g., see Mohan et al, 2011). 

Our hypothesis class G includes all linear combinations of shallow regression trees, 
i.e., functions of the form g{x) = <^kfk(^), where ff. for fc = 1,..., K are shallow 

(limited-depth) regression trees and K is finite. In our experiments, K is set to 5. Similar to 
the objective for L-DSS in Eq. (6), we construct the NL-DSS objective to identify g € G 
that maximizes the dual criteria of ordering accuracy and temporal smoothness as: 


NL-DSS . 
Objective ' 



(7) 


Note that since the soft max-margin formulation is not defined for a non-linear classifier we 
drop the term ||w||^ /2. Thus, without loss of generality, Ao can be replaced by 1. Now, the 
relative emphasis on the smoothing versus the ordering components are changed by varying 
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We optimize the NL-DSS objective using the gradient boosted regression trees (GBRT) 
learning algorithm (Friedman, 2001; Mason et al, 1999; Burges, 2010). Gradient boosting 
methods grow g incrementally, in a greedy fashion, by adding a weak learner—in this case, 
a regression tree—at each iteration. A tree that most closely approximates the gradient of 
(7® evaluated at g obtained in the previous iteration is added (Friedman, 2001). 

The per-iteration computational complexity of this approach is equivalent to the compu¬ 
tational complexity of building a single regression tree, which is |T| log |r| (Hothorn et al, 
2006), where |T| is the number of unique tuples in the set O U S' of tuple pairs. 


3 Experiments 

We now describe the evaluation of the proposed DSSL framework. Before discussing nu¬ 
merical results on a real-world dataset, in Section 3.1, we use a simple toy example to illus¬ 
trate the behavior of DSS related to the following questions. First, when DSS is transported 
between environments with different degrees of interventional confounds, how is the per¬ 
formance of DSS impacted compared to the performance of a supervised learning algorithm 
that uses downstream-events as labels? Second, when clinical comparisons are generated by 
implementing automated coarse-grading rules, does the learned score simply learn the rule 
itself? In Section 3.2, we provide background on our application: we introduce sepsis, the 
dataset used, and the guideline used for generating the clinical comparisons needed to train 
the DSS scores. Next, in Section 3.3, we provide an overview of the experiments and the 
experimental setup followed by a detailed discussion of the numerical results on the sepsis 
data in Section 3.3.3. 


3.1 Learning DSS for SyntheticFlu 

For these experiments, we create a simple toy disease called SyntheticFlu as follows. We 
quantify severity as a function of the patient’s temperature and white blood cell counts 
(WBC)—as the temperature or the WBC increases, risk of mortality increases. We assume 
that the disease manifests in two ways: with 50% probability, patients are sampled from a 
model where the temperature tends to deteriorate while the WBC remains normal, and for 
the other fraction of the population, their WBC tends to deteriorate while the temperature 
remains normal. Each of these measurements assume one of 10 states (e.g., the temperature 
ranges from 99 to 108 °F). In the absence of treatment, for a measurement that is deteri¬ 
orating, it retains its value T in the following timestep with probability 0.3, increases to 
T -b 1 with probability 0.5, and decreases to T — 1 with probability 0.2. For a measurement 
that is assumed to stay normal, the corresponding transition probabilities are 0.7, 0.1 and 
0.2. States 1 — 2 are defined to be “benign”(e.g., temperatures of 99°F and 100°F) where 
an individual can be discharged with probability 0.5 (i.e. their sampled trajectory ends). 
States 6 — 9 are defined to be “severe” (e.g., temperatures between 104°F to 107°F) where 
an individual may receive treatment with probability pj. Administration of treatment (e.g., 
antibiotics) transitions the individual to one of the benign states. Finally, an individual dies 
when she or he reaches state 10. 

Evaluating transportability: The first question we investigate is regarding the transporta¬ 
bility of the different scores. Namely, when DSS is moved between different treatment 
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regimes, how is the performance of DSS impacted compared to the performance of a super¬ 
vised learning algorithm that uses downstream-events as labels. Since treatments affect the 
prevalence of adverse outcomes, we show that the risk scores learned via the latter approach 
are highly sensitive to treatment patterns and therefore, in our example setting, generalize 
poorly compared to DSS. We sample 1000 patients each for the train and test sets. Data are 
sampled from different treatments regimes as shown in Table 1. For example, in scenario 
1, no treatments are prescribed in either the train or test regimes. In scenario 5, in the train 
regime, treatments are prescribed with 0.3 probability only for treating high temperature 
but not for high WBC. However, in the test regime, treatments are prescribed only for high 
WBC but not for high temperature. Training L-DSS and NL-DSS requires generating clin¬ 
ical comparisons. To do so, we randomly sample pairs from the observed data. For a pair 
(x^,Xj), we consider to represent a more severe state if one of the measurements is at 
least 2 units higher in xf than in Xj (e.g., temperature of 103°F is more severe than 101°F) 
and the other measurement is at least as high in as in Xj. Many such pairs are sampled 
for the train and test set. We use a standard protocol for training logistic regression (LR) 
(Paxton et al, 2013). Train and test samples are generated from the patient trajectories via a 
sliding window approach with the outcome defined as whether or not the patient died within 
10 timesteps in the future. On the test set, we consider a patient to be correctly identified as 
at-risk patient if his severity score was greater than a certain threshold value at any point of 
patient’s hospital stay. We measure performance of the obtained scores using the area under 
the curve (AUC) obtained on the task of predicting per-patient mortality (Paxton et al, 2013). 

Results: In scenario 1 and scenario 2 where the treatment patterns are the same across the 
train and test regimes, all three scores perform equally well. However, as the treatment pat¬ 
terns begin to diverge to an increasing degree as seen in scenarios 4 and 5, LR’s performance 
drops while the DSS performance does not change. It is worth noting that such discrepan¬ 
cies between the train and test regimes can occur within the same hospital when comparing 
treatment practice before and after deploying a predictive model. Specifically, while clin¬ 
icians continue to treat high-risk states, the resulting LR score learned from this data will 
underestimate risks for the treated states. Once the decision support system is deployed and 
clinicians begin to rely on the predictive tool, they may erroneously undertreat high-risk pa¬ 
tients, thereby worsening outcomes. 


Scenario 

strain 

Ft 

strain 

^WBC 


PwBC 

Logistic Regression 

L-DSS 

NL-DSS 

#1 

0 

0 

0 

0 

0.974 

0.973 

0.974 

#2 

0.1 

0 

0.1 

0 

0.978 

0.990 

0.991 

#3 

0.1 

0 

0 

0 

0.963 

0.974 

0.981 

#4 

0.3 

0 

0 

0 

0.769 

0.973 

0.981 

#5 

0.3 

0 

0 

0.3 

0.510 

0.978 

0.996 


Table 1 Analysis of transportability of the L-DSS, NL-DSS and logistic regression based severity scores 
between different treatment regimes. P^gc’ Pt**’ PwBC denote the probability of treatment for 

temperature and WBC in the train and test regimes respectively. 


Relationship of Learned Seores to the Coarse Grades: As mentioned, for many diseases, 
clinical guidelines provide rules for coarse-grained assessment of severity stages of a dis¬ 
ease. These guidelines can be used for automated generation of clinical comparison pairs 
with umabiguous severity ordering. This raises a natural question of whether a DSS learned 
from such clinical comparisons simply recovers these clinical guidelines and thus yields no 
generalization beyond the coarse grading. To evaluate this hypothesis, we extend the setup 
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described above. Not sure if the previous sentence gives too strong of a claim. After all, 
we show a single example where this does not happen. Specifically, we augment the fea¬ 
ture vectors to include the coarse grades which are in turn derived from the temperature and 
WBC measurements. We derive the coarse severity grade from the observed feature vectors 
as follows. We assign a feature vector a severity of 0 if both the corresponding WBC and 
temperature measurements are in states 1 or 2 (e.g., temperature below 101°F), a severity of 
1 if exactly one of the measurements is in state 3 or higher (e.g., temperature above or equal 
to 101 °F), and a severity of 2 if both measurements are in states 3 or higher. We consider all 
three combinations of coarse graded severity pairs and randomly sample 6000 clinical com¬ 
parisons and a similar number of smoothness pairs. We use Aq = 100 for the L-DSS and 
sweep values As between 0.1 and 1000. We show results using data sampled from regime 1 
though the conclusions do not depend on the treatment pattern. 

Since perfect ordering accuracy can be achieved using only the coarse grading compo¬ 
nent of the feature vector, one might expect that the learned scores will rely on the coarse 
grading feature alone. In fact, this is the case when As is small (= 0.1). However, such DSS 
score will exhibit abrupt changes between consecutive time points, e.g., when the tempera¬ 
ture or WBC progresses from state 2 to state 3 or vice versa. As As increases, the smoothness 
term in the DSS objective encourages the learning of temporally smooth scores which rely 
increasingly on the WBC and temperature features alone. Thus, in this scenario, the smooth¬ 
ness constraint allows the L-DSS and NL-DSS scores to generalize beyond coarse grades. 
In Figure 3, we depict the severity score assigned to a feature vector in which WBC is in 
state 1 and temperature varies from 99°F to 107°F for varying values of As- 



^ 0.1 
* 1 
^10 
100 

^1000 


Fig. 3 Experiments on synthetic data: relationship of learned scores to the coarse grades. L-DSS and NL- 
DSS scores for varying values of As for a patient with WBC in state 2 and temperature ranging from 99°F to 
107°F. The scores are normalized so that temperature of 99°F is given the score of 0 and the temperature of 
107 °F is given the score of 1. 


3.2 Sepsis, MIMIC-II and the Surviving Sepsis Campaign Guideline 

In the following experiments on the real-world clinical data, our goal is to learn a score that 
assesses the severity of sepsis. 

Sepsis: Sepsis is a whole-body inflammatory response to infection; it is a leading cause 
of death in the inpatient setting, with especially high mortality among patients who develop 
septic shock, a major sepsis-related adverse event. Both sepsis and septic shock are known 
to be associated with high morbidity, longer hospital stay and increased health care cost (Ku¬ 
mar et al, 2011). Often, the risk of sepsis-related adverse outcomes can be reduced by early 
treatment (Sebat et al, 2007), thus a scoring system that allows precise tracking of changes 
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in sepsis-related disease severity is of great importance. 

Dataset: We use MIMIC-II, a publicly available dataset containing electronic health record 
data from patients admitted to the ICUs at the Beth Israel Deaconess Medical Center from 
2001 to 2008 (Saeed et al, 2002, 2011). We only include adults (> 15 years old) in our 
study (N = 16, 234). We compute 45 different features that derived from vital sign mea¬ 
surements, clinical history variables, and laboratory test results. We provide the complete 
list of features in Appendix A. We impute missing data using linear interpolation. Other ex¬ 
amples of imputation methods used for ICU monitoring datasets include model-based (e.g.. 
Ho et al, 2012) and Last Observation Carried Forward (LOCF) (Hug, 2009). These meth¬ 
ods require making numerous domain-specific assumptions. For example, in LOCF, each 
measurement is carried forward for a finite time window which length is determined based 
on the typical sampling frequency of this measurement. Since this choice is orthogonal to 
the focus of our paper, we implement linear interpolation as it is the simplest of the above 
methods. 

Guideline for grading sepsis severity: In order to train a severity score in the DSSL frame¬ 
work, we need a set O of pairs of feature vectors ordered by their severity. We create these 
pairs automatically by leveraging the coarse severity grading of sepsis established in the 
Surviving Sepsis Campaign Guideline (SSCG) (Dellinger et al, 2013). The SSCG provides 
rules for identifying when an individual is in each of the four stages of severity: septic shock, 
severe sepsis, SIRS and none. For each of these stages, the guideline defines criteria using 
1) a combination of thresholds for individual measurements, and 2) presence of specific 
diagnosis codes or diagnoses noted in their clinical notes. For example, the SIRS criterion 
is met when values of at least two out of these five features are out of their normal range. 
White blood cell count per microliter, for example, is considered to be out of the normal 
range if its value is either below 4.0 or above 12.0. Heart rate is considered to be out of the 
normal range if it is above 90 beats-per-minute. The stage of severe sepsis is reached when 
physician suspects that patient developed an infection, the SIRS criterion is met, and at least 
one organ system is showing signs of failure. Finally, septic shock is defined as severe sepsis 
with observed hypotension despite significant fluid resuscitation. In Table 2, we specify the 
criteria used for grading each of the stages. It is also worth noting that the guideline does not 
provide a grade at all times: on average, application of these criteria grades less than 40% 
of data entries of a patient. We use the time points with available SSSG grading to generate 
clinical comparison as described in Section 3.3.1. We also note that the SSSG grades use 
features beyond those used in the severity score. For example, to assess the severity grade, 
we use keyword search in transcripts written at the time of discharge to determine whether 
the patient had developed an infection. Similarly, features such as whether or not the patient 
received sufficient fluids are used for grading but excluded when learning the severity scores 
as these are caregiver driven and only indirectly measure severity. 


3.3 Learning DSS for sepsis in ICU patients 

We begin with an overview of the experiments. In our first experiment, we assess the quality 
of the trained L-DSS and NL-DSS scores by their performance on the task of distinguishing 
between the different severity stages of sepsis. This is done by calculating severity order¬ 
ing of held out pairs of feature vectors and measuring their concordance with the ground 
truth provided by the SSCG guideline. We show that our scores significantly outperform 
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the three main scoring systems that are widely used in ICUs. The success of our scores in 


Stage 

Criteria 

SIRS 

At least two out of the following four conditions hold: 

1. Heart rate is > 90 beats-per-minute and was measured in the last 2 hours. 

2. Temperature is either > 38°C or < 36°C and was measured in the last 8 hours. 

3. Respiratory rate is > 20 beats-per-minute and was measured in the last 2 hours, or 
arterial partial pressure of CO 2 is < 32 mm Hg and was measured in the last 8 hours. 

4. White blood cell count in thousands per microliter is either > 12.0 or < 4.0 and 
was measured in the last 8 hours. 

Severe 

Sepsis 

Patient’s clinical record contain words sepsis or septic or ICD-9 code for infection, 

SIRS criteria holds, and at least one of the following nine criteria holds: 

1. Systolic blood pressure is < 90 mm Hg and was measured in the last 2 hours. 

2. Blood lactate measurement is > 2.0 micromol per liter and was taken in the last 2 hours. 

3. Urine output over the past two hours is < 0.5 milliliter per kg. 

4. Patient has no chronic renal insufficiency and her or his creatinine measurement is > 2.0 
milligrams per deciliter, and was taken in the last 8 hours. 

5. Patient has no chronic liver disease, her or his bilirubin measurement is > 2.0 
milligrams per deciliter and was taken in the last 8 hours. 

6 . Platelet count is < 100.000 per microliter and was measured in the last 8 hours. 

7. International normalized ratio (INR) is > 1.5 and was measured in the last 8 hours. 

8 . Patient experienced pneumonia during their hospital stay as indicated by ICD-9 codes, 
measurements of both partial arterial pressure of oxygen {Pa 02 ) and of fraction of inspired 
oxygen {Fi 02 ) were taken in the last 8 hours, and it holds that PaO^j < 200. 

9. Patient experienced acute lung infection unrelated to pneumonia during their hospital stay 
as indicated by ICD-9 codes, the measurements of PaO^ and FiO^ were taken 

in the last 8 hours, and it holds that PaO^ / FiO^ < 250. 

Septic 

Shock 

The following two conditions hold: 

1. The patient has severe sepsis. 

2. The patient experiences hypotension (i.e., systolic blood pressure < 90 mm Hg) 
for at least last 30 minutes. 

None 

The following two conditions hold: 

1. Heart rate was measured in the last two hours, temperature was measured in the last 

8 hours, respiratory rate was measured in the last 2 hours, arterial partial pressure 
of CO 2 was measured in the last 8 hours, and white blood cell count was measured 
in the last 8 hours. 

2. At most one of the SIRS conditions holds. 


Table 2 Definition of the Surviving Sepsis Campaign Guideline (SSCG) for sepsis severity grading. 


distinguishing between sepsis stages is encouraging, but expected, since our scores are ex¬ 
plicitly trained for severity ordering. In the following experiments, we evaluate whether the 
learned scores also generalize well to measuring fine-grained changes in severity. Towards 
this, we first examine whether DSS is sensitive to changes in severity state leading up to 
adverse events. We consider septic shock, an adverse event of sepsis, and measure whether 
the learned severity scores increase leading upto septic shock as one would expect. Indeed, 
we show that the L-DSS and NL-DSS scores show a significant upward trend in the time 
period leading up to the adverse event. 

Next, we evaluate whether scores trained by L-DSS and NL-DSS are sensitive to changes 
in severity state due to therapy. Specifically, we compare the trend of the disease severity 
score before and after fluid bolus, a therapy used to relieve hypotension in septic patient 
(Dellinger et al, 2013). We show that the learned scores show significant change in their 
trend around the time of treatment administration, thus indicating sensitivity to treatment 
responses. For instance, a DSS score that trends upward over the time period leading up to 
administration of fluid bolus, is likely to trend down during the time period after treatment 
administration or to trend up at a slower pace. 
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Motivated by the results showing sensitivity to impending adverse events, in the last 
experiment, we measure the performance of the learned severity scores for early detection 
of septic shock. We train a simple classifier using the DSS and its trend features to predict 
risk of septic shock onset in the next 48 hours. We show that this predictor significantly 
outperforms routinely used clinical scoring systems. 

The rest of this section proceeds as follows. In Section 3.3.1 we describe our experi¬ 
mental setup and the procedure for the automated generation of clinical comparison pairs. 
Section 3.3.2 presents the baseline methods. Finally, in Section 3.3.3 we present the numer¬ 
ical results and analysis. 

3.3.1 Experimental Setup and Automatic Generation of the Clinical Comparison Pairs 

We begin by randomly dividing the 16, 234 patients in our dataset into training (60%) and 
testing sets (40%). Within the training set, we assign two thirds of the patients to the devel¬ 
opment set and the remaining third to the validation set. For each of the development, vali¬ 
dation and testing sets of patients we generate a separate set O of clinical comparison pairs. 
We consider six combinations of possible pairs of different sepsis stages, i.e., none-SIRS, 
none-severe, none-shock, SIRS-severe, SIRS-shock, severe-shock. For each combination of 
stages, we randomly select an equal number of feature vectors (x^jxj) sampled at time 
points (t^, t^) such that x? corresponds to a more severe state and x? corresponds to a less 
severe state in this combination. For the development and testing sets we sample 2000 clin¬ 
ical comparisons for each combinations of sepsis severity stages resulting in total of 12000 
clinical comparisons for each set. For the validation set that contains only half of the num¬ 
ber of patients in the development and testing sets, we sample 1000 clinical comparisons for 
each combination of sepsis severity stages resulting in total of 6000 clinical comparisons. 

3.3.2 Baselines: Routinely Used Clinical Severity Scores in the ICU 

We compare the performance of the learned disease severity scores to three widely used 
ICU-based severity scoring systems (Keegan et al, 2011). The first two scores are based on 
the Sequential Organ Failure Assessment or the SOFA score (Vincent et al, 1996) which 
was originally designed to assess sepsis-related organ damage severity. The SOFA method 
scores severity at the per-organ level. Two variants of the SOFA that are commonly used 
are: 1) total SOFA computed as the sum of SOFA scores of all organ systems, and 2) worst 
SOFA represented as the highest value of SOFA score among of all organ systems. We also 
compare the performance of our score to that of the Acute Physiology and Chronic Health 
Evaluation or APACHE II (Knaus et al, 1985), which is a widely used scoring system for 
assessing general (not necessarily sepsis-related) disease severity in hospitalized individuals. 

3.3.3 Performance Evaluation of Scores Trained Using the L-DSS and NL-DSS Algorithms 

In this section, we present numerical results of evaluation of the learned L-DSS and NL- 
DSS scores. 

Selection of free parameters. The L-DSS and NL-DSS algorithms contain free parame¬ 
ters Ao and As that remain to be specified. Let us first consider the L-DSS algorithm. With 
As set to 0, we sweep the values of Aq from 1 to 10^^ and select the value of Aq that maxi¬ 
mizes accuracy of ordering held out pairs on the validation set. That is, we count the fraction 
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of ordering pairs in the set O that are concordant with the ordering prescribed by the ground 
truth comparisons. We refer to this quantity as the severity ordering accuracy or SOA. In the 
evaluations below, Xq is thus set to 10^. 

For a given Aq, different choices of As yield trajectories with differing levels of smooth¬ 
ness. An appropriate choice of As can be made by sweeping through a wide range of values 
and selecting the value that optimizes performance for a use case that the end user has in 
mind. For example, if the primary application of the score is for the early detection of indi¬ 
viduals at risk for septic shock, the As that maximizes prediction accuracy on the validation 
set is selected. Alternately, when no assumptions are given. As can be selected to maximize 
smoothness without hurting ordering accuracy on the validations set. We present results 
using these two approaches for setting As; we refer to these settings as A^^"^ and A|™ respec¬ 
tively. In Figure 4, we show performance on the validation set and mark the selected setting 
of As for each of the scores. Thus, A|“ and Al™ were set to be 1.62 • 10® and 1.13 • 10® for 
L-DSS and 2000 and 100 for NL-DSS. While we do not experiment with this approach, it 
is worth noting that yet another means for selecting As is based on an expert’s knowledge of 
the degree of short-term to long-term variability expected within that disease domain. For 
example, in slowly evolving diseases, the value of As that yields a small ratio of short-term 
to long-term variability may be preferable. 







Fig. 4 Experiment 1. Severity ordering accuracy and the shock prediction AUC on the validation set for 
different values of Aj. We mark the values of As selected for further evaluations, i.e., A™“ and Aj“, with 
vertical lines. 


Experiment 1: Distinguishing between the severity stages of sepsis. We begin by evalu¬ 
ating whether L-DSS and NL-DSS can distinguish and correctly order the different stages of 
sepsis severity. We compare their severity ordering accuracy to that of routinely used clinical 
scores — APACHE-II, Total SOFA and Worst SOFA (Keegan et al, 2011). The results of 
this evaluation are presented in Table 3. 

We observe that L-DSS and NL-DSS significantly outperform APACHE II, Total SOFA 
and Worst SOFA scores for all considered values of As. The performance achieved by L- 
DSS and NL-DSS is significant from a clinical standpoint as it orders severity states more 
accurately than the three clinical scores, all of which are widely used to assess severity of 
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Method 

As = A^ As = Af 

Proposed 

L-DSS 

0.860 

0.844 

Scores 

NL-DSS 

0.946 

0.938 

Routine 

APACHE II 


0.68 

Clinical 

Total SOFA 


0.63 

Scores 

Worst SOFA 


0.63 


Table 3 Experiment 1. Severity ordering accuracy (SOA) for different methods. The 95% confidence interval 
on SOA is obtained using the bootstrap algorithm and is contained in the ±0.002 band around SOA. 

ICU patients. In particular, the SOFA score was designed specifically to measure sepsis 
related severity. 

While the above-mentioned result is promising, it remains to be seen whether the ob¬ 
tained scores are also sensitive enough to capture changes in severity status that extend 
beyond the coarse grading that the different stage definitions provide. Towards this, we next 
evaluate whether the learned scores exhibit the following desirable characteristics: 1) Are 
they sensitive to changes in severity leading up to septic shock, an adverse event of sepsis? 
and, 2) Are they sensitive to post-therapy changes in severity? 

Experiment 2: Are the learned DSS sensitive to changes in severity leading up ad¬ 
verse events? To address the question of whether the learned scores are sensitive enough to 
capture changes in severity that can occur leading up to an adverse event, we examine the 
L-DSS and NL-DSS behavior in the 18 hour duration leading up to septic shock. 

We consider all patients with septic shock in our test set with at least 18 hours of data 
prior to septic shock onset (N = 587). On these patients, we define three time intervals of 
interest: 1) 6 hours prior to the onset of septic shock; 2) 6 — 12 hours prior to the onset of 
the septic shock; 3) 12 — 18 hours prior to the onset of septic shock. We denote the average 
values of the learned scores in these intervals by so-6, S 6 -i 2 , and S 12 - 18 , respectively. 

We calculate values of Z\i = so-6 — S 6-12 and A 2 = (so-6 — S 6 - 12 ) — (s 6 -i 2 — 
S 12 - 18 ) for each patient. In Figure 5 (a) we show the full probability density of Ai and 
A 2 . The value of Ai is positive in at least 70% of the cases for all four considered scores. 
The value of A 2 is positive in at least 57% of the cases. Using the standard one-tailed t-test, 
we assess the p-value (denoted by Ptrend-up in Table 4) for whether the recorded Z\i can be 
observed by chance under the null hypothesis that Z\i are drawn from a zero mean distribu¬ 
tion. Similarly, we assess the p-value (denoted by prate acceleration in Table 4) for whether the 
recorded Zi 2 can be observed by chance under the null hypothesis that Zi 2 are drawn from 
a zero mean distribution. Across all values of As, for both the L-DSS and the NL-DSS, the 
obtained p-values rule out the null hypothesis, that is, the learned scores leading upto septic 
shock show significant upward trend and acceleration. Using the bootstrap, we estimate the 
median p-value for a range of sample sizes and significance is achieved (i.e., the median 
p-value for that sample size is below 0.01) with as few as 30 samples for Zli and 420 for 
A 2 . As an example, in Figure 5(b), we show the L-DSS and NL-DSS trajectories for two 
patients for the period leading up to septic shock. 

Experiment 3: Are the learned DSS sensitive to post-therapy ehanges in severity? We 

now evaluate whether the scores trained by the L-DSS and NL-DSS methods are sensitive 
to changes in severity state due to administration of fluid bolus- a treatment used for septic 
shock (Dellinger et al, 2013). Towards this, we use the self-controlled case series method. 
We compare trends exhibited by DSS values over the five hour intervals prior to and post 
the administration of fluid bolus. We refer to the trends over these intervals as Apnor and 
Apost- The value of Apnor is computed as the difference between the value of the DSS at 
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Method 

As = 

As = A“- 


Ptrend-up 

L-DSS 

IQ-dt. 

10-39 

NL-DSS 

10-39 

1 

0 


Prate increase 

L-DSS 

1.3 ■ 10-5 

1 

0 

NL-DSS 

1.3 ■ 10-3 

2.6 ■ 10-3 


fraction of positive Ai (95% confidence interval) 

L-DSS 

0.71 (0.68-0.75) 

0.70(0.67-0.74) 

NL-DSS 

0.74 (0.70-0.77) 

0.75 (0.71-0.78) 


fraction of positive A 2 (95% confidence interval) 

L-DSS 

0.58 (0.54-0.62) 

0.57 (0.53-0.61) 

NL-DSS 

0.59 (0.56-0.63) 

0.58 (0.55-0.63) 


Table 4 Experiment 2. p-value Ptrend-up for tho observed Ai; p-value prate acceleration for the observed A 2 ; the 
fraction of positive Ai with 95% confidence interval, and the fraction of positive A 2 with 95% confidence 
interval. Calculations of all values are based on 587 examples. 


L-DSS NL-DSS 



Patient #1 Patient #2 





Fig. 5 Experiment 2. Sensitivity of learned DSS to changes in severity leading up to adverse events, (a) 
Probability density of Ai and A 2 ', (b) DSS trajectories over the 18 hour period leading up to septic shock for 
two example patients. 


the time of treatment administration and the mean value of DSS over the five hour interval 
prior to treatment administration. Similarly, the value of zipost is calculated as the difference 
between the mean value of DSS over the five hour interval after treatment administration 
and the value of DSS at the moment of treatment administration. If the patient is responsive 
to fluid therapy, then Zipeat = ^post — ^pnor < 0, that is, if the DSS was trending up prior 
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to treatment administration, we expect this trend to be attenuated or even reversed by the 
treatment. 

We identify cases of fluid administration events related to sepsis using the following 
criteria: 1) the patient is experiencing SIRS, severe sepsis or septic shock at the time of 
treatment administration, and 2) the patient is hypotensive (has systolic blood pressure below 
100 mm Hg), a commonly used criteria for prescribing fluids in sepsis. To avoid confounding 
due to multiple administration of fluids, we restrict our attention to treatment administrations 
that were not preceded or followed by another fluid bolus administration within a five hour 
window. This yielded a total of 81 fluid bolus administration events. 

In Figure 6 (a) we plot the distribution of Zitreat = Zipost — Zipnor. Overall, the change of 
trend Zltreat is negative in at least 75% of recorded values of Zitreat. Employing the one-tailed 
t-test, we obtain the p-value ptreatmern response (shown in Table 5) for whether the observed val¬ 
ues of Zitreat = Zipost — Zipnor Can be observed by chance under the null hypothesis that Zitreat 
are drawn from a zero mean distribution. For our sample size of 81 cases, across all values 
of As for both the L-DSS and NL-DSS, the obtained p-values rule out the null hypothesis in 
favor of the stated hypothesis, that is, DSS shows significant response to therapy. Moreover, 
using the boostrap to estimate the median p-value for a range of samples sizes, we observe 
that significance is achieved with as few as 20 samples. In Figure 6 (b), we show the L-DSS 
and NL-DSS trajectories for two example patients around the time of fluid bolus adminis¬ 
tration. 


Method 

As = A““ 

As = A”'^ 


^treatment response 

L-DSS 

5 ■ 10-' 

2 ■ 10-^'J 

NL-DSS 

5.1 ■ 10-® 

O 

1 

o 

00 


fraction of negative Zitreat 

(95% confidence interval) 

L-DSS 

0.84 (0.77-0.93) 

0.77 (0.68-0.85) 

NL-DSS 

0.75 (0.67-0.85) 

0.84 (0.77-0.93) 


Table 5 Experiment 3. Statistical significance of DSS response to fluid bolus treatment, p-value 
Ptreatmentresponse for the observed Zitreat; the fraction of negative zitreat with 95% confidence interval. Cal¬ 
culations of all values are based on 81 examples. 

Predictive Score for Septic Shock Using DSS: The high ordering performance in experi¬ 
ment 1 and the significant upward trend observed in experiment 2 suggest that a score such 
as DSS maybe useful for detecting individuals at risk for septic shock. Thus, we conclude 
this section by showing that the value of the severity score and its temporal trajectory can 
be used in prediction tasks, specifically, in the task of early detection of septic shock. 

We begin by describing the trend features that we derive from sequences of instanta¬ 
neous values of severity scores. For a patient p, we let for i = 1,..., Tp be the value of the 
assigned severity scores at time tf. At every time point tf, we augment the score value 
with the seven derived trend features that were inspired by Wiens et al (2012) and adapted to 
the specifics of the MIMIC II dataset. We present the complete list of these features in Table 
6 . In this table. Feature 1 is the average value of the score since admission. We note that the 
feature vectors are not sampled at a fixed rate, i.e., the length of the time interval between 
two consecutive feature vectors of a patient need not be fixed. We thus weigh every value of 
the severity score with the length of the time interval between the current feature vector and 
the previous one. Features 2 and 3 are versions of Feature 1 where more recent scores have a 
higher relative weight. Feature 4 captures the average rate of change in severity score since 
admission. Feature 5 is a version of Feature 4 in which more recent rates of score change are 
given higher relative weight. Finally, Features 6 and 7 capture the variability of the score. 
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L-DSS NL-DSS 



Patient #1 



OO 

oo 

Q 

_J 



Patient #2 



(b) Time since hospitalization (hours) Time since hospitalization (hours) 


Fig. 6 Experiment 3. Sensitivity of DSS to post-treatment changes in severity, (a) Probability density of 
zitreat; (b) DSS trajectories before and after administration of fluid bolus for two example patients. 


To train a predictive score for the task of early identification of sepsis, we use logistic 
regression to learn a mapping from a patient’s feature vectors to the probability of occur¬ 
rence of septic shock in the following 48 hours. This approach was inspired by (Clermont 
et al, 2001; Minne et al, 2008; Ho et al, 2012). Specifically, we use logistic regression reg¬ 
ularized with an elastic net penalty (Zou and Hastie, 2005). As positive training examples, 
we consider feature vectors taken less than 48 hours prior to an adverse event. As neg¬ 
ative training examples, we take feature vectors from patients that do not experience the 
adverse event during their hospital stay and do not receive any treatment of fluid-bolus. The 
choice of leaving out individuals who receive fluid-bolus but do not experience septic shock 
is owing to the fact that their outcome is censored due to treatment (Paxton et al, 2013). We 
refer to this procedure as LR-Shock. Employing LR-Shock with severity score and its trend 
features as its input, we obtain a predictor of the onset of septic shock in the next 48 hours. 
We refer to predictors based on the L-DSS and NL-DSS scores and their derived features 
as LR-Shock-bL-DSS-fDerived and LR-Shock-fNL-DSS+Derived, respectively. As a base¬ 
line for comparison, we train LR-Shock-i-x, which is a predictor trained by LR-Shock with 
feature vectors x as its input. 

The performance of these predictors is measured in terms of per-patient prediction ac¬ 
curacy that is determined in the following way. Consider an arbitrary value of a threshold r. 
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Feature 

Description 

Expression 

1 

Average score since admission 


2 

Linear weighted score 


3 

Quadratic weighted score 


4 

Average score change rate 


5 

Linearly weighted average score change rate 


6 

Average absolute score change rate 

E,=l ^i-i • tf-4 

7 

Linearly weighted absolute score change rate 

.P r.P J J-1 

2^j=i ^j-i tP-t{ 


Table 6 Trend features derived from the trajectory of the severity score. 

We say that a patient was correctly identified to have septic shock if he or she had a severity 
score higher than r at least once prior to the onset of septic shock. We say that a patient 
was falsely identified to have septic shock if he did not experience septic shock during his 
hospital stay, but his or her severity score rose above r at some time point. By considering a 
series of values of r, we can obtain receiver operating curve that corresponds to our predic¬ 
tor and the associated area under the curve (AUC). In our experiment, we use the AUC on 
the validation set of patients to find the optimal values of free parameters of the LR-Shock 
method. We report the performance of our classifiers in terms of AUC on the test set (see 
Table 7). This table also contains the predictive performance of the L-DSS and NL-DSS 
scores. 

The learned scores are significantly more accurate than the APACHE II and the SOFA 
based scores. However, more interestingly, we note that DSS performance is comparable to 
LR-Shock-tx which was trained to optimize predictive performance unlike the DSS scores. 
Thus, our learning objective addresses the bias introduced due to treatment related con¬ 
founding without hurting predictive performance in this application. 



Features 

As - Aj, 

As = 

Predictors 

based 

on proposed 

scores 

L-DSS 

NL-DSS 

L-DSS -1- Derived 
NL-DSS -1- Derived 

0.836 (0.824-0.849) 
0.859 (0.846-0.872) 
0.856 (0.844-0.868) 
0.861 (0.849-0.874) 

0.853 (0.841-0.865) 
0.878 (0.866-0.890) 
0.857 (0.845-0.869) 
0.874 (0.862-0.886) 

Predictor based 
on feature vectors 
alone 

X 

0.864 (0.852-0.875) 

Routine 

Clinical 

Scores 

APACHE II 

Total SOFA 

Worst SOFA 

0.620 (0.600 - 0.641) 

0.602 (0.582-0.622) 

0.601 (0.581-0.621) 


Table 7 Per-patient accuracy of early detection of sepsis in terms of the corresponding AUC. The 95% 
confidence interval on the AUC is obtained using the bootstrap method and is given in parentheses. 


4 Related Work on Ranking 

Our work is closely related to the body of literature on pairwise methods for ranking in 
the field of Information Retrieval (IR). In IR, the ranking problem is typically formulated 
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as the task of sorting retrieved documents by their relevance to the query. Pairwise rank¬ 
ing approaches (e.g., Joachims (2002); Zheng et al (2008)) aim to learn a ranking function 
that orders pairs of documents in concordance to their relevance. These ranking functions 
are derived using a variety of machine learning techniques. Joachims (2002) proposed a 
max-margin approach to learning a ranking, dubbed SVMRank. A computationally efficient 
method for training of the SVMRank as a primal-form optimization problem was later pro¬ 
posed by Chapelle and Keerthi (2010). Burges et al (2005) proposed RankNet, a neural 
network based algorithm for ranking that tunes its parameters to minimize a simple explicit 
probabilistic cost function that captures the task of pairwise ranking. In later work, Burges 
et al (2006) and Burges (2010) observed that a ranking function can be learned from gra¬ 
dients of the ranking objective function alone, without explicit specification of the whole 
cost function. They proposed two new ranking algorithms that build on this observation, 
one based on neural networks and one based on gradient boosting trees. Zheng et al (2008) 
proposed a general boosting framework for learning ranking functions for a wide family 
of cost functions. Additional approaches include FRank (Tsai et al, 2007), nested ranker 
(Matveeva et al, 2006), and multiple hyperplane ranker (Qin et al, 2007). In some disease 
domains, as is the case in ours, one might be able to obtain rank data rather than pair-wise 
comparisons. In these cases, ordinal regression based approaches have been developed for 
ranking (Herbrich et al, 1999; Chu and Keerthi, 2007). However, acquiring ranked samples 
from the clinician is often not practical. Moreover, by relying on pariwise comparisons, our 
framework opens up the possibility of exploring new forms of supervision that can be au¬ 
tomatically generated. For example, two time slices with the same severity grade may be 
ordered based on their time to an adverse event in the case when no interventions have been 
administered between these time slices. 


5 Discussion and Future Work 

This paper proposes DSSL, a novel ranking-based framework for scalable and automated 
learning of disease severity scores in new disease domains and populations. DSSL only 
requires a means for obtaining clinical comparisons — ordered pairs comparing disease 
severity state at different times. We argue that this form of supervision is more natural to 
elicit than asking clinical experts to map the disease severity score, or to encode an accurate 
model of disease progression. Moreover, supervision of this type can also be obtained in an 
automated way by leveraging existing clinical guidelines. 

We test DSSL by applying it to a large, real-world electronic health record dataset and 
to synthetic clinical records. Using synthetic clinical records, we show that scores learned 
using DSSL are less sensitive to changes in treatment administration patterns between the 
train and test environments compared to the regression based approach that is currently used. 
Using a large real-world dataset of ICU clinical records, we show that the scores learned 
using DSSL are significantly more accurate, both for severity assessment and early adverse- 
event detection, compared to widely used clinical severity scores. Further, these scores have 
face validity—their behavior aligns with what is expected clinically. They trend upwards 
leading up to an adverse event, and show decline post-treatment. 

DSSL has a number of other advantages. It allows experts to automatically tune the qual¬ 
ity of the score by increasing the granularity and amount of supervision given. Additionally, 
the quality of the learned scores can be improved by incorporating additional constraints 
related to disease progression. For example, expected clinical response to therapy can be 
directly incorporated as a constraint within the optimization objective. 
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One limitation of our current work is the heavy reliance on the availability of a large 
number of clinical comparison pairs * In domains where existing clinical guidelines cannot 
be leveraged, it is umealistic to obtain thousands of clinical comparison pairs from experts. 
In these domains, the use of active learning may help mitigate this limitation. Supervision 
in the form of additional constraints related to disease progression may also prove helpful. 
Another aspect that deserves further exploration is how the proposed scores can be made 
interpretable in practice. While NL-DSS yields high performance, the score is constructed 
using a bag of regression trees. Thus, is it not obvious how one might make the score inter¬ 
pretable at the point of care. In practice, scores are often deployed with a specific use case in 
mind. For example, if the score were to be used for early detection, the precision-recall curve 
can be used to identify suitable thresholds for taking action based on the DSS score. Another 
approach might be to identify and to display which factors led to the increase or decrease in 
the score value. Finally, using simulated data, our experiments show that the scores learned 
using DSSL are less dependent on the practice patterns of the regime where the model was 
developed. While promising, further analysis is needed to understand susceptibility of DSSL 
to treatment administration patterns. 

In summary, electronic tools that can integrate the diverse and the large set of mea¬ 
surements collected clinically to produce an accurate, real-time severity score can enable 
clinicians to provide more timely interventions. Further, these scores should be robust to 
changes in clinical practice patterns as the mere introduction of a decision-support tool can 
change clinician behavior. This paper introduces a new ranking-based formulation for the 
problem of learning (predictive) disease severity scores. By leveraging clinical comparisons, 
a form of supervision that is less susceptible to clinician practice patterns, DSSL provides a 
promising alternative to existing methods. 


References 

AHRQ (2015) Guideline syntheses, http://www.guideline.gov/syntheses/ 
index.aspx 

Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) 
Learning to rank using gradient descent. In: Proceedings of the 22nd International Con¬ 
ference on Machine learning, ACM, pp 89-96 
Burges CJ (2010) From ranknet to lambdarank to lambdamart: An overview. Tech, rep., 
Microsoft Research 

Burges CJ, Ragno R, Le QV (2006) Learning to rank with nonsmooth cost functions. In: 

Advances in Neural Information Processing Systems, pp 193-200 
Chapelle O, Keerthi SS (2010) Efficient algorithms for ranking with SVMs. Information 
Retrieval 13(3):201-215 

Chu W, Keerthi SS (2007) Support vector ordinal regression. Neural computation 
19(3):792-815 

Clermont G, Angus DC, DiRusso SM, Griffin M, Linde-Zwirble WT (2001) Predicting 
hospital mortality for patients in the intensive care unit: a comparison of artificial neural 
networks with logistic regression models. Critical Care Medicine 29(2):291-296 
Dellinger RP, Levy MM, Rhodes A, Annane D, Gerlach H, Opal SM, Sevransky JE, Sprung 
CL, Douglas IS, Jaeschke R, et al (2013) Surviving sepsis campaign: international guide- 

’ In other experiments (Dyagilev and Saria, 2015), we have shown that the performance of our framework 
degrades gracefully as the number of clinical comparisons is reduced. In particular, on the task of severity 
assessment, scores trained on approximately 150 clinical comparisons are as accurate as APACHE and SOFA. 



24 


Kirill Dyagilev, Suchi Saria 


lines for management of severe sepsis and septic shock, 2012. Intensive Care Medicine 
39(2): 165-228 

Dyagilev K, Saria S (2015) Learning a severity score for sepsis: A novel approach based 
on clinical comparisons. In: AMIA Annual Symposium Proceedings, American Medical 
Informatics Association 

Fine MJ, Auble TE, Yealy DM, Hanusa BH, Weissfeld LA, Singer DE, Coley CM, Marrie 
TJ, Kapoor WN (1997) A prediction rule to identify low-risk patients with community- 
acquired pneumonia. New England Journal of Medicine 336(4):243-250 
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals 
of Statistics pp 1189-1232 

Ghanem-Zoubi NO, Vardi M, Laor A, Weber G, Bitterman H (2011) Assessment of disease- 
severity scoring systems for patients with sepsis in general internal medicine departments. 
Critical Care Medicine 15(2) 

Henry KE, Hager DN, Provonost PJ, Saria S (2015) A targeted real-time early warning score 
(TREWScore) for septic shock. Science Translational Medicine 7:299-322 
Herbrich R, Graepel T, Obermayer K (1999) Large margin rank boundaries for ordinal re¬ 
gression. Advances in neural information processing systems pp 115-132 
Ho JC, Lee CH, Ghosh J (2012) Imputation-enhanced prediction of septic shock in ICU 
patients. In: Proceedings of the ACM SIGKDD Workshop on Health Informatics (HI- 
KDD12) 

Hothorn T, Homik K, Zeileis A (2006) Unbiased recursive partitioning: A conditional infer¬ 
ence framework. Journal of Computational and Graphical Statistics 15(3):651-674 
Hug C (2009) Detecting hazardous intensive care patient episodes using real-time mortality 
models. PhD thesis 

Jackson CH, Sharpies LD, Thompson SG, Duffy SW, Couto E (2003) Multistate markov 
models for disease progression with classification error. Journal of the Royal Statistical 
Society: Series D (The Statistician) 52(2): 193-209 
Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of 
the eighth ACM SIGKDD international conference on Knowledge Discovery and Data 
mining, ACM, pp 133-142 

Keegan MT, Gajic O, Afessa B (2011) Severity of illness scoring systems in the intensive 
care unit. Critical Care Medicine 39(1): 163-169 
Knaus WA, Draper EA, Wagner DP, Zimmerman JE (1985) APACHE II: a severity of dis¬ 
ease classification system. Critical care medicine 13(10):818-829 
Kumar G, Kumar N, Taneja A, Kaleekal T, Tarima S, McGinley E, Jimenez E, Mohan A, 
Khan RA, Whittle J, et al (2011) Nationwide trends of severe sepsis in the 21st century 
(2000-2007). CHEST Journal 140(5): 1223-1231 
Kuo TM, Lee CP, Lin CJ (2014) Large-scale kernel RankSVM. In: Proceedings of the 2014 
SIAM International Conference on Data Mining, SIAM 
Marshall JC, Cook DJ, Christou NV, Bernard GR, Sprung CL, Sibbald WJ (1995) Multiple 
organ dysfunction score: a reliable descriptor of a complex clinical outcome. Critical Care 
Medicine 23(10):1638-1652 

Mason L, Baxter J, Bartlett P, Frean M (1999) Boosting algorithms as gradient descent in 
function space. Advances in Neural Information Processing Systems 
Matveeva I, Burges C, Burkard T, Laucius A, Wong L (2006) High accuracy retrieval with 
multiple nested ranker. In: Proceedings of the 29th annual international ACM SIGIR con¬ 
ference on Research and Development in Information Retrieval, ACM, pp 437^44 
Medsger T, Bombardier! S, Czirjak L, Scorza R, Rossa A, Bencivelli W (2003) Assess¬ 
ment of disease severity and prognosis. Clinical and Experimental Rheumatology 21(3; 



Learning (Predictive) Risk Scores in the Presence of Censoring due to Interventions 


25 


SUPP/29);S42-S46 

Minne L, Abu-Hanna A, de Jonge E, et al (2008) Evaluation of SOFA-based models for 
predicting mortality in the ICU: A systematic review. Critical Care Medicine 12(6):R161 
Mohan A, Chen Z, Weinberger KQ (2011) Web-search ranking with initialized gradient 
boosted regression trees. In: Yahoo! Learning to Rank Challenge, Citeseer, pp 77-89 
Mould D (2012) Models for disease progression: new approaches and uses. Clinical Phar¬ 
macology & Therapeutics 92(1):125-131 

Paxton C, Niculescu-Mizil A, Saria S (2013) Developing predictive models using electronic 
medical records: Challenges and pitfalls. In: AMIA Annual Symposium Proceedings, 
American Medical Informatics Association, vol 2013, p 1109 
Pirracchio R, Petersen ML, Carone M, Rigon MR, Chevret S, van der Laan MJ (2015) Mor¬ 
tality prediction in intensive care units with the super ICU learner algorithm (SICULA): 
a population-based study. The Lancet Respiratory Medicine 3(l):42-52 
Qin T, Zhang XD, Wang DS, Liu TY, Lai W, Li H (2007) Ranking with multiple hyper¬ 
planes. In: Proceedings of the 30th annual international ACM SIGIR conference on Re¬ 
search and Development in Information Retrieval, ACM, pp 279-286 
Saeed M, Lieu C, Raber G, Mark R (2002) MIMIC II: a massive temporal ICU patient 
database to support research in intelligent patient monitoring. In: Computers in Cardiol¬ 
ogy, 2002, IEEE, pp 641-644 

Saeed M, Villarroel M, Reisner AT, Clifford G, Lehman LW, Moody G, Heldt T, Kyaw 
TH, Moody B, Mark RG (2011) Multiparameter Intelligent Monitoring in Intensive 
Care II (MIMIC-II): a public-access intensive care unit database. Critical Care Medicine 
39(5):952 

Saria S, Roller D, Penn A (2010a) Learning individual and population level traits from 
clinical temporal data. In: Predictive Models in Personalized Medicine workshop. Neural 
Information Processing Systems 

Saria S, Rajani AK, Gould J, Roller D, Penn AA (2010b) Integration of early physiological 
responses predicts later illness severity in preterm infants. Science Translational Medicine 
2(48) 

Sebat F, Musthafa AA, Johnson D, Rramer AA, Shoffner D, Eliason M, Henry R, Spurlock 
B (2007) Effect of a rapid response system for patients in shock on time to treatment and 
mortality during 5 years. Critical Care Medicine 35(11):2568-2575 
Tsai ME, Liu TY, Qin T, Chen HH, Ma WY (2007) Frank: a ranking method with fidelity 
loss. In: Proceedings of the 30th annual international ACM SIGIR conference on Re¬ 
search and Development in Information Retrieval, ACM, pp 383-390 
Vincent JL, Moreno R, Takala J, Willatts S, De Mendon 9 a A, Braining H, Reinhart C, Suter 
P, Thijs L (1996) The SOFA (Sepsis-related Organ Failure Assessment) score to describe 
organ dysfunction/failure. Intensive Care Medicine 22(7):707-7I0 
Wang X, Sontag D, Wang F (2014) Unsupervised learning of disease progression models. 
In: Proceedings of the twentieth ACM SIGRDD international conference on Rnowledge 
Discovery and Data mining, ACM, pp 85-94 

Wiens J, Horvitz E, Guttag JV (2012) Patient risk stratification for hospital-associated c. 
diff as a time-series classification task. In: Advances in Neural Information Processing 
Systems, pp 467^75 

Zheng Z, Zha H, Zhang T, Chapelle O, Chen R, Sun G (2008) A general boosting method 
and its application to learning ranking functions for web search. In: Advances in Neural 
Information Processing Systems, pp 1697-1704 
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. Journal of 
the Royal Statistical Society: Series B (Statistical Methodology) 67(2):301-320 



26 


Kirill Dyagilev, Suchi Saria 


A Features used for DSS learning 

In this section we provide the complete list of all features provided to L-DSS and NL-DSS methods for DSS 
learning in experiments in Section 3 . 3 . 3 . These can be divided into three categories: clinical information, 
measurements of vital signals, and results of laboratory analysis. 

Clinical information: age of the patient; whether patient has a pacemaker; whether patient was diagnosed 
with AIDS; whether patient received treatment that compromised his immune system; patient’s current weight 
and his weight on admission; presence of ICD -9 codes for diabetes, dialysis, chronic renal insufficiency, heart 
failure, or chronic liver disease; whether patient is currently in the cardiac surgery recovery unit; presence 
or absence of hematologic malignancy; jaundice; whether a patient was mechanically ventilated; presence of 
metastatic carcinoma. 

Measurements of vital signals: Glasgow coma score; heart rate; Riker Sedation-Agitation Scale; temper¬ 
ature; respiratory rate; systolic blood pressure; shock index defined as the ratio of heart rate to systolic blood 
pressure; peripheral capillary oxygen saturation; 

Results of laboratory analysis: Blood urea nitrogen levels (BUN); hematocrit; international normalized ratio 
(INR); white blood cell count (WBC); blood pH level as measured by an arterial line; partial pressure of arte¬ 
rial oxygen {Pa02)\ fraction of inspired oxygen {Fi02)\ ratio of Pa02 to Fi02', partial pressure of CO2; 
blood lactate measurements; bilirubin; creatinine; potassium and sodium levels; platelet count; hemoglobin; 
total urine output over the past two hours per kg of weight; partial thromboplastin time; arterial CO2 levels; 
levels of aspartate aminotransferase. 



